Description of the NTU System used for MET-2
نویسندگان
چکیده
Named entities form the major components in a document. When we catch the fundamental entities, we can understand a document to some degree. This paper employs different types of information from different levels of text to extract named entities, including character conditions, statistic information, titles, punctuation marks, organization and location keywords, speech-act and locative verbs, cache and n-gram model. In the formal run of MET-2, the F-measures P&R, 2P&R and P&2R are 79.61%, 77.88% and 81.42%, respectively. INTRODUCTION People, affairs, time, places and things are five basic entities in a document. When we catch the fundamental entities, we can understand a document to some degree. Natural Language Processing Laboratory (NLPL) in Department of Computer Science and Information Engineering (CSIE), National Taiwan University (NTU) starts to study named entity extraction problem in 1993. At first, we focus on the extraction of Chinese person names, transliterated person names [1] and organization names [2]. The training data and the testing data in these experiments are selected from three Taiwan newspaper corpora (China Times, Liberty Times News and United Daily News). Chen and Lee [3] reported the precision rate and the recall rate for the extraction of Chinese person names, transliterated person names and organization names are (88.04%, 92.56%), (50.62%, 71.93%) and (61.79%, 54.50%), respectively in the 16th International Conference on Computational Linguistics. We employ these results to several applications. Chen and Wu [4] considered person names as one of clues in sentence alignment. Chen and Lee [3] show its application to anaphora resolution. Chen and Bian [5] proposed a method to construct white pages for Internet/Intranet users automatically. We extract information from World Wide Web documents, including proper nouns, E-mail addresses and home page URLs, and find the relationship among these data. Chen, Ding and Tsai [6,7] dealt with proper noun extraction for information retrieval. In MUC-7 and MET-2, we attend named entity extraction tasks for both English and Chinese. We extend our previous work on this problem to cover more named entity types such as locations, date/time expressions and monetary and percentage expressions. Several issues have to be addressed during extension. One of the major differences between Chinese and English language processing is that segmentation is required for Chinese. That is, we have to identify word boundary in Chinese sentences beforehand. That makes Chinese named entity extraction tasks more changeable. Besides, the vocabulary set and the Chinese coding set used in Taiwan and in China are not the same. The documents adopted in MET-2 are selected from newspapers in China, thus we have to transform simplified Chinese characters in GB coding set to traditional Chinese characters in Big-5 coding set before testing. A word that is known may become unknown due to transformation. For example, the character "ú" in "ú" (early morning) is used in traditional Chinese characters. However, "D" is used in simplified Chinese characters and it is also a legal traditional Chinese character that denotes another meaning. In other words, the mapping from GB to Big5 is "D", which is an unknown word based on our dictionary. The different vocabulary set between China and Taiwan results in different segmentation. This paper is organized as follows. Section 2 illustrates the flow of named entity extraction and the summary scores of our team in MET-2 formal run. Sections 3, 4 and 5 propose methods to extract named people, organizations and locations. Section 6 deals with the rest of named entities, i.e., date/time expressions and monetary and percentage expressions. After each section, we discuss the sources of errors in the formal run. Section 7 concludes the remarks. FLOW OF NAMED ENTITY EXTRACTION The following shows the flow of named entity extraction in MET-2 formal run. (1) Transform Chinese texts in GB codes into texts in Big-5 codes. (2) Segment Chinese texts into a sequence of tokens.
منابع مشابه
Efficient treatment of baker’s yeast wastewater using aerobic membrane bioreactor
A membrane bioreactor (MBR) system based on a dead-end immersed hollow fiber membrane and filamentous fungus Aspergillus oryzae were used for treatment of baker’s yeast wastewater. The fungus was adapted to the wastewater in the bioreactor for two weeks before starting the continuous process. Average organic loading rate of 4.2 kg COD/m3.d was entered the bioreactor. MBR system was able to redu...
متن کاملAccreditation: A Way to Quality Assurance and Improvement
During recent years there have been many discussions about the meaning of accreditation and the steps for establishing an accreditation system in our country. It is worth to mention that in the third national economic, social and cultural development plan, the process of accreditation of medical education programs and its reinforcement has been mentioned. Simply, accreditation can be defined as...
متن کاملابررسانایی دمای بالا در حالت آلاییدگی بهینه
Intensive study of the high temperature superconductors has been ongoing for two decades. A great deal of this effort has been devoted to the underdoped regime, where the new and difficult physics of the doped Mott insulator has met extra complications including bilayer coupling/splitting, shadow bands, and hot spots. While these complications continue to unfold, in this short overview the fo...
متن کاملMicrobial Water Quality of Karaj City in Terms of Heterotrophic Bacteria Count Index: Zoning by the Geographic Information System (GIS)
Background and Purpose: The aim of this study was to evaluate the microbial quality of drinking water in Karaj, Iran in terms of heterotrophic bacteria count index (HPC) and its variation trend using geographic information system (GIS) in 2016. Materials and methods: In this study, water sampling was carried out in Karaj distribution network in all 12 regions and based on this, 3 samples were t...
متن کاملFuture study of Description System Architecture Approaches with Emphasis on Strategic Management
Systems Architecture is a generic discipline to handle objects (existing or to be created) called systems, in a way that supports reasoning about the structural properties of these objects. Systems Architecture is a response to the conceptual and practical difficulties of the description and the design of complex systems. Systems Architecture is a generic discipline to handle objects (existin...
متن کاملStructural elucidation of phosphoglycolipids from strains of the bacterial thermophiles Thermus and Meiothermus.
The structures of two major phosphoglycolipids from the thermophilic bacteria Thermus oshimai NTU-063, Thermus thermophilus NTU-077, Meiothermus ruber NTU-124, and Meiothermus taiwanensis NTU-220 were determined using spectroscopic and chemical analyses to be 2'-O-(1,2-diacyl-sn-glycero-3-phospho) -3'-O-(alpha-N-acetyl-glucosaminyl)-N-glyceroyl alkylamine [PGL1 (1)] and the novel structure 2'-O...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1998